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1.  Introduction 


Under  the  auspices  of  a  DoD  High  Performance  Computing  Modernization  Program  (HPCMP)  Capability 
Applications  Project  (CAP),  researchers  at  the  U.S.  Army  Research  Laboratory  (ARL)  evaluated  the  perfor¬ 
mance  of  the  CTH  shock  physics  codejl]  on  the  Opteron  cluster  recently  installed  at  the  ARL  Major  Shared 
Resource  Center  (MSRC).  This  system  has  2304  processors  for  batch  processing,  each  running  at  a  clock 
speed  of  2.2  GHz.  Scalability  trials  were  conducted  using  up  to  2048  processors  and  involved  the  simula¬ 
tion  of  the  yawed,  oblique  impact  of  a  long  rod  penetrator  with  a  thin  plate.  This  case  has  been  used  in  the 
past  as  a  standard  benchmark  in  assessing  the  scalability  of  CTH  on  many  other  scalable  systems  deployed 
by  HPCMP[2,  3, 4,  5,  6,  7,  8].  The  scalability  of  CTH  on  the  Opteron  cluster  was  studied  for  both  fixed  and 
adaptive  meshes. 

After  the  scalability  study  was  completed,  CTH  simulations  were  conducted  to  evaluate  the  potential  to 
use  shock  physics  simulations  to  augment  experimental  data  in  behind  armor  debris  applications.  These 
simulations  were  conducted  for  both  fixed  and  adaptive  meshes  using  512  -  2048  processors.  A  variation 
of  a  fracture  model  currently  under  development  at  ARL  was  also  evaluated. 

This  paper  describes  the  scalability  of  CTH  on  the  Opteron  cluster  and  the  results  of  a  set  of  simulations  to 
model  the  formation  and  evolution  of  behind  armor  debris  fields. 


2.  Scalability  Study 


The  scalability  of  CTH  on  an  Opteron  cluster  (Stryker)  was  determined  through  a  series  of  simulations 
that  employed  both  fixed  and  adaptive  meshes.  The  fixed-mesh  scalability  simulations  were  conducted 
with  a  nearly  constant  workload.  This  was  done  to  keep  the  computation-to-communication  ratio  as  close 
to  constant  as  possible  for  simulations  involving  different  numbers  of  processors.  Maintaining  a  nearly 
constant  computation-to-communication  ratio  and  minimizing  disk  access  for  intermediate  plot  and  restart 
files  during  the  time  integration  permitted  the  computational  performance  to  be  isolated  and  measured  as 
a  function  of  the  number  of  processors  used. 

As  the  number  of  processors  was  increased,  the  fixed  mesh  was  incrementally  refined  by  uniformly  de¬ 
creasing  the  characteristic  cell  size  in  each  coordinate  direction  by  the  nearest  integer  factor  of  2-1/3.  This 
approach  approximately  doubles  the  total  number  of  Eulerian  cells  with  each  successive  mesh  refinement. 
The  characteristics  of  the  meshes  used  in  the  scalability  study  are  summarized  in  Table  1 .  In  this  table,  the 
columns  NI,  NJ,  and  NK  refer  to  the  number  of  Eulerian  cells  in  the  x,  y,  and  z  directions,  respectively. 
The  mesh  sizes  listed  in  the  table  produce  computational  sub-domains  containing  approximately  387,000 
Eulerian  cells  each.  For  the  2048-processor  simulation,  this  results  in  a  computational  domain  containing 
approximately  800  million  Eulerian  cells. 

The  scalable  performance  of  the  message-passing  code  is  measured  by  the  "grind  time,"  which  is  the  aver¬ 
age  processor  time  required  for  the  code  to  update  all  field  variables  for  one  computational  cell  in  a  given 
time  increment  (cycle).  In  a  case  of  ideal  scalability,  the  grind  time  will  decrease  by  a  factor  of  two  for  every 
doubling  of  processors  used  if  the  ratio  of  computation  to  communication  is  held  constant. 
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Table  1.  Fixed-mesh  CTH  scalability  study  parameters. 


Number  of 
Processors 

NI 

NJ 

NK 

Total 

Cells 

Cell  Size 
(mm) 

1 

215 

30 

60 

387,000 

1.0000 

2 

271 

38 

75 

772,350 

0.7934 

4 

341 

48 

95 

1,554,960 

0.6305 

8 

430 

60 

120 

3,096,000 

0.5000 

16 

541 

76 

151 

6,208,516 

0.3974 

32 

683 

95 

191 

12,393,035 

0.3148 

64 

860 

120 

240 

24,768,000 

0.2500 

128 

1083 

151 

302 

49,386,966 

0.1985 

256 

1366 

190 

382 

99,144,280 

0.1574 

512 

1720 

240 

480 

198,144,000 

0.1250 

1024 

2166 

302 

604 

395,095,728 

0.0993 

2048 

2732 

380 

764 

793,154,240 

0.0787 

The  results  of  the  fixed-mesh  CTH  scalability  study  are  presented  in  Figure  1.  This  figure  compares  the 
performance  of  Stryker  to  that  of  two  other  ARL  MSRC  clusters:  the  32-processor  prototype  Opteron  cluster 
(Cage)  and  the  256-processor,  3.06-GHz  Xeon  cluster  (Powell).  Each  of  these  clusters  has  two  processors  per 
node.  Simulations  were  performed  using  both  1  and  2  CTH  tasks  per  node. 


Figure  1.  Scalable  performance  of  fixed-mesh  CTH  on  Opteron  &  Xeon  clusters. 

The  fixed-mesh  CTH  scalability  results  show  that  the  performance  on  Stryker  is  approximately  the  same  as 
that  on  Cage,  the  prototype  system.  The  2-task/node  performance  on  the  Opteron  systems  was  almost  the 
same  as  the  1-task/node  performance.  The  same  is  not  true  for  the  Xeon-based  system  (Powell)  in  which 
the  2-task/node  performance  is  less  than  that  of  the  1-task/ node.  Linear  scalability  was  obtained  on  Stryker 
for  all  simulations  up  to  the  largest  case  using  2048  processors.  The  orange  line  in  Figure  1  is  the  result  of 
a  regression  analysis  of  the  data  from  the  1-  and  2-task/node  runs  on  Stryker  which  resulted  in  a  parallel 
efficiency  of  approximately  79%. 

An  adaptive  mesh  refinement  (AMR)  capability  has  been  added  to  CTH  which  allows  the  definition  of 
the  mesh  to  change  during  the  simulation  based  on  the  evolving  characteristics  of  the  simulation[9].  The 
adaptation  of  the  mesh  is  based  on  user-defined  indicators,  such  as  the  value,  gradient,  or  difference,  of 
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a  variable  in  the  solution  (pressure,  density,  velocity,  stress,  etc.).  This  technique  results  in  simulations  in 
which  the  most  highly  resolved  mesh  "follows"  the  activity  of  interest  to  the  analyst  while  using  less  highly 
resolved  mesh  in  the  remainder  of  the  computational  domain.  This  allows  the  analyst  to  configure  highly 
resolved  simulations  that  have  fewer  total  computational  cells  than  a  comparable  fixed-mesh  simulation 
having  the  same  minimum  cell  size. 

The  AMR  implementation  in  CTH  is  a  block-based  scheme  in  which  each  block  consists  of  an  orthogonal 
mesh  with  a  fixed  number  of  cells  in  the  x,  y,  and  z  directions.  The  blocks  are  connected  in  a  hierarchal 
manner  with  adjacent  blocks  having  either  exactly  the  same  cell  size  or  exactly  a  2:1  ratio  in  cell  size.  Re¬ 
finement  or  un-refinement  of  the  mesh  is  accomplished  through  a  series  of  transitions  of  adjacent  blocks 
with  a  difference  in  mesh  density  of  2:1.  All  mesh  blocks  at  a  given  mesh  density  are  at  the  same  refinement 
level.  The  finest  mesh  resolution  that  can  exist  in  the  computational  domain  is  controlled  by  defining  the 
maximum  refinement  level  of  the  mesh. 

The  AMR  CTH  benchmark  used  in  the  scalability  study  was  configured  to  be  physically  identical  to  the 
fixed-mesh  simulation.  The  only  difference  between  the  fixed-mesh  simulation  and  the  AMR  simulation 
was  the  definition  of  the  mesh.  The  size  of  the  mesh  in  the  AMR  simulation  was  scaled  with  the  number 
of  processors  in  a  manner  similar  to  the  fixed-mesh  study.  However,  it  is  not  possible  to  precisely  scale  the 
total  number  of  cells  in  the  AMR  simulation  since  the  refinement  and  un-refinement  indicators  are  based 
on  the  physics,  not  the  topology  of  the  computational  domain.  Thus,  to  scale  the  size  of  the  simulation  in  a 
controlled  manner,  the  maximum  refinement  level  was  increased  by  one  for  every  factor  of  eight  increase  in 
the  number  of  processors.  The  2:1  ratio  of  cell  size  between  refinement  levels  results  in  a  factor  of  approx¬ 
imately  eight  in  the  total  number  of  cells  in  the  3-D  simulation.  The  variation  of  the  maximum  refinement 
level  and  the  resulting  minimum  cell  size  with  the  number  of  processors  used  is  summarized  in  Table  2. 

Table  2.  AMR  CTH  scalability  study  parameters. 


Number  of 
Processors 

Maximum 

Refinement 

Level 

Minimum 

Cell 

Size  (mm) 

1 

4 

1.875 

2 

4 

1.875 

4 

4 

1.875 

8 

5 

0.938 

16 

5 

0.938 

32 

5 

0.938 

64 

6 

0.469 

128 

6 

0.469 

256 

6 

0.469 

512 

7 

0.234 

1024 

7 

0.234 

2048 

7 

0.234 

The  results  of  the  AMR  CTH  scalability  study  are  provided  in  Figure  2.  This  figure  compares  the  grind  time 
vs.  number  of  processors  used  for  Stryker  and  Powell  (an  AMR  scalability  study  was  not  conducted  on 
Cage).  The  results  of  the  AMR  study  show  the  same  trends  as  the  fixed-mesh  study,  with  Stryker  demon¬ 
strating  a  parallel  efficiency  of  approximately  80%.  On  the  Xeon-based  system,  there  is  a  clear  difference 
between  the  1-  and  2 -task /node  performance,  while  there  is  not  a  noticeable  difference  on  the  Opteron- 
based  system.  As  in  the  fixed-mesh  study,  linear  scalability  was  achieved  for  all  simulations  up  to  the 
maximum  of  2048  processors  in  the  study. 
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Figure  2.  Scalable  performance  of  AMR  CTH  on  Opteron  &  Xeon  clusters. 


3.  Behind  Armor  Debris  Study 


Behind  armor  debris  is  a  major  cause  of  damage  in  military  vehicles  that  have  been  perforated  by  a  pen- 
etrator,  bullet  or  fragment.  The  ability  to  predict  the  debris  field  resulting  from  attack  by  such  a  threat  is 
critical  to  assessing  and  improving  the  survivability  of  tactical  systems.  The  ARL  Survivability  and  Lethal¬ 
ity  Analysis  Directorate  (SLAD)  has  the  mission  of  providing  such  assessments  to  vehicle  designers.  The 
ARL  Weapons  and  Materials  Research  Directorate  (WMRD)  has  been  working  to  develop  the  capability  to 
model  numerically  the  behind  armor  debris  resulting  from  armor  perforation,  for  application  to  the  SLAD 
mission. 

Modeling  of  the  debris  field  historically  has  been  done  by  statistically  analyzing  data  from  carefully  con¬ 
trolled  experiments.  The  difficulty  of  collecting  this  information  makes  it  an  expensive  and  lengthy  process. 
Supplementing  these  experiments  with  numerical  simulations  is  a  natural  synergy,  but  has  not  yet  been  suc¬ 
cessfully  exploited  because  previous  computer  systems  were  unable  to  cope  with  the  daunting  size  of  the 
simulations. 

With  the  addition  of  the  Opteron  cluster  to  the  ARL  MSRC,  numerical  modeling  of  these  experiments  is 
now  within  reach.  The  experiment  modeled  as  a  demonstration  of  the  technique  consists  of  a  30-mm  Armor 
Piercing  Discarding  Sabot  (APDS)  round  perforating  a  1-inch-thick  armor  steel  plate.  The  resulting  behind 
armor  debris  impacts  a  large  (610-mm  x  610-mm)  [2-ft  x  2-ft],  thin  (0.8-mm)  [1 /32-inch]  mild  steel  witness 
plate  placed  610  mm  behind  the  armor.  Perforations  made  in  the  witness  plate  by  the  debris  are  measured, 
and  conclusions  drawn  about  the  size,  mass,  spatial  distribution  and  velocity  of  the  debris  field.  This  is 
painstaking  work,  but  it  results  in  a  reasonably  accurate  characterization  of  the  debris  field. 

The  difficulty  in  modeling  this  experiment  arises  primarily  from  two  factors.  First,  the  experiment  is  inher¬ 
ently  three-dimensional  in  nature  due  to  the  random  distribution  of  failure  in  the  plate.  Thus  any  simulation 
of  the  experiment  must  be  done  in  three  dimensions  (3-D).  Second,  the  wide  range  of  length  scales  requires 
a  fine  mesh  resolution  to  resolve  the  debris  field  that,  when  extended  over  the  610-mm  air  space  and  large 
area  of  the  witness  plate,  requires  approximately  one-half  billion  cells  for  a  relatively  coarse  resolution  (two 
cells  through  the  thickness  of  the  witness  plate).  Compounding  the  problem,  the  small  cell  size  requires 
a  small  integration  time-step,  so  a  huge  number  of  computation  cycles  is  required  to  traverse  the  debris 
through  the  610-mm  air  space. 
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The  result  of  modeling  such  an  experiment  with  CTH  can  be  seen  in  Figure  3a,  which  shows  the  state  at 
600  ps  after  impact  of  the  30-mm  APDS  round  on  the  armor  plate,  when  the  debris  field  is  well  developed 
but  has  not  yet  impacted  the  witness  plate.  This  3-D  simulation  ran  on  the  Opteron  cluster  using  2048  pro¬ 
cessors,  or  9  teraflops  of  compute  power.  To  run  the  simulation  to  1.2  ms,  when  most  debris  has  perforated 
the  witness  plate,  required  five  days,  the  equivalent  of  over  28  processor-years.  The  simulation  required 
more  than  1  TB  of  memory,  and  generated  over  300  GB  of  field  data  to  disk.  A  few  years  ago,  running  this 
simulation  would  not  have  been  possible. 


Figure  3.  CTH  simulations  of  behind  armor  debris  experiment,  600  ps  after  impact:  a.  fixed  mesh,  b.  AMR. 

It  is  possible  to  reduce  the  size  of  this  simulation  by  employing  the  AMR  technique  in  CTH.  When  AMR  is 
employed,  the  mesh  is  refined  only  in  regions  of  interest.  As  a  result,  the  large  empty  areas  of  this  simulation 
are  coarsely  resolved,  and  mesh  refinement  follows  the  fragments  in  the  debris  field  as  they  fly  toward  the 
witness  plate.  The  state  of  the  AMR  simulation  at  600  ps  is  shown  in  Figure  3b.  This  3-D  simulation  ran  for 
54  hours  on  512  processors  using  a  total  of  0.5  TB  of  memory.  This  is  a  very  large  simulation,  but  almost 
one-tenth  the  processor-hours  of  the  fixed-mesh  simulation  in  Figure  3a. 

One  objective  of  the  current  work  was  to  verify  that  the  AMR  CTH  simulation  will  produce  the  same 
result  as  the  fixed-mesh  simulation.  The  work  showed  that  mesh  resolution  in  CTH  has  an  impact  on  the 
predicted  fragmentation,  and  must  be  carefully  controlled.  A  difference  between  the  finest  resolution  of 
target  material  in  the  AMR  simulation  (Figure  3b)  and  the  constant  resolution  of  the  fixed-mesh  simulation 
(Figure  3a)  contributed  to  the  differences  seen  in  the  debris  field.  If  resolution  is  consistent,  AMR  CTH  was 
found  to  be  an  accurate  and  computationally  effective  substitute  for  the  fixed-mesh  case.  Another  objective 
of  this  work  was  to  verify  the  ability  to  run  CTH  on  large-scale  clusters  to  efficiently  conduct  extremely 
large  computations  on  a  large  number  of  processors.  This  work  provided  a  realistic  test  that  demonstrated 
scalability. 

In  the  second  part  of  the  behind  armor  debris  study,  a  fracture  model  was  modified  to  improve  the  CTH 
prediction.  Researchers  at  the  Lawrence  Livermore  National  Laboratory  (LLNL)[10]  have  demonstrated 
with  a  Lagrangian  code  the  effectiveness  of  providing  a  statistical  distribution  of  fracture  properties  in 
simulations.  Here,  the  technique  is  incorporated  into  the  Eulerian  code  CTH  and  applied  to  modeling 
this  ballistic  experiment.  In  a  conventional  CTH  simulation,  all  cells  containing  target  material  have  the 
same  set  of  fracture  model  parameters,  so  all  fail  in  the  same  way.  This  effect  is  shown  graphically  in 
Figure  4a,  which  shows  the  bulge  on  the  rear  of  the  target  plate  just  prior  to  the  penetrator  breaking  through. 
Damage  is  shown  in  this  figure  by  coloring;  blue  is  no  damage,  red  is  fully  damaged.  Notice  the  uniformity 
and  symmetry  of  the  damage  in  the  bulge.  The  new  model  installed  in  CTH  provides  a  spatially  random 
distribution  of  values  for  the  principal  fracture  model  parameter,  although  in  the  aggregate  its  population  is 
Weibull-distributed.  This  causes  non-uniform,  stochastic  failure  of  the  armor  plate,  as  shown  in  Figure  4b. 
The  resultant  behind  armor  debris  field  is  strongly  dependent  on  the  nature  of  the  Weibull  distribution  of 
the  fracture  parameter,  as  quantified  by  the  Weibull  modulus,  m,  which  is  a  user-supplied  input  to  CTH. 
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As  an  analogy,  think  of  the  Weibull  modulus  as  determining  the  standard  deviation  of  the  distribution  of 
the  fracture  model  parameter.  A  Weibull  modulus  of  m  =  2  provided  the  results  shown  in  Figure  4b.  As 
can  be  seen  by  comparing  Figures  4a  and  4b,  a  more  realistic  fragmentation  of  the  target  is  obtained  with 
the  distributed  fracture  parameter  approach. 


Figure  4.  Bulge  of  the  rear  of  the  target  plate  showing  damage  just  prior  to  break-out  (60  ps  after  impact) 
for  Weibull  modulus  of:  a.  m  =  0,  b.  m  =  2. 

In  a  third  part  of  this  work,  Jerry  Clarke  of  the  ARL  Computational  and  Information  Sciences  Directorate 
(CISD)  is  developing  software  based  on  the  Interdisciplinary  Computing  Environment  (ICE)  which  will 
automatically  identify  and  quantify  all  contiguous  bodies  in  a  CTH  calculation.  This  type  of  automatic 
analysis  was  not  previously  possible  with  CTH  calculations.  Called  FragFinder,  this  software  identifies 
regions  (i.e.,  fragments)  where  the  volume  fraction  for  each  material  is  above  a  certain  threshold,  and  de¬ 
termines  the  volume  and  velocity  of  these  regions.  FragFinder  was  used  to  analyze  the  debris  field  in  a 
conventional  CTH  simulation  (to  =  0)  and  in  a  CTH  simulation  using  statistical  fracture  (to  =  2).  The 
results  are  presented  in  Figure  5,  where  the  volume  determination  has  been  converted  to  mass.  This  plot 
shows  the  total  number  of  fragments,  from  both  the  penetrator  and  the  target,  with  a  mass  greater  than  or 
equal  to  a  given  (abscissa)  value.  The  open  symbols  show  the  experimental  data.  The  dashed  line  shows 
the  result  for  standard  CTH,  which  over-predicted  the  number  of  fragments,  especially  the  number  of  small 
fragments.  The  solid  line  shows  the  result  of  a  simulation  using  the  statistical  model  (with  m=2)  for  the  tar¬ 
get  only,  a  significantly  improved  result.  Figure  5  indicates  that  with  the  proper  choice  of  Weibull  modulus, 
and  with  the  statistical  model  applied  to  both  the  target  and  the  penetrator,  a  more  realistic  debris  field  can 
be  obtained  than  arises  from  the  classic  method  of  using  a  constant  parameter. 
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Figure  5.  CTH  prediction  of  mass  distribution  of  fragments  compared  to  experimental  results. 

4.  Summary 

The  linear  scalability  of  CTH  on  the  ARL  MSRC  Opteron  cluster  has  been  demonstrated  for  simulations 
using  up  to  2048  processors.  The  linear  scalability  was  demonstrated  for  simulations  using  both  fixed  and 
adaptive  meshes.  As  a  result,  the  general  efficacy  of  large  scale  weapons  effects  simulations  on  scalable 
systems  has  been  demonstrated. 

The  work  described  herein  has  also  shown  that  numerical  simulation  of  behind  armor  debris  is  now  within 
the  ability  of  current  MSRC  resources,  and  simulations  can  be  successfully  exploited  to  supplement  the 
expensive  experiments.  Furthermore,  the  new  capabilities  of  statistical  fracture  and  automatic  fragment 
quantification  make  the  technique  more  useful. 
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